UNCLASSIFIED 


AD  NUMBER 

ADB008708 

LIMITATION  CHANGES 
TO: 

Approved  for  public  release;  distribution  is 
unlimited. 


FROM: 

Distribution  authorized  to  U.S.  Gov't,  agencies 
only;  Test  and  Evaluation;  OCT  1975.  Other 
requests  shall  be  referred  to  Rome  Air 
Development  Center,  Griffiss,  AFB,  NY  13441. 


AUTHORITY 

RADC  ltr,  12  Apr  1978 


THIS  PAGE  IS  UNCLASSIFIED 


THIS  REPORT  HAS  BEEN  DELIMITED 
AND  CLEARED  FOR  PUBLIC  RELEASE 
UNDER  DOD  DIRECTIVE  5200,20  AND 
NO  RESTRICTIONS  ARE  IMPOSED  UPON 
ITS  USE  AND  DISCLOSURE, 

DISTRIBUTION  STATEMENT  A 

APPROVED  FOR  PUBLIC  RELEASE; 
DISTRIBUTION  UNLIMITED. 


RADC-TR-75-264  * 

Final  Technical  Report 
October  1975 


AUTOMATIC  CLASSIFICATION  OF  LANGUAGES 


Distribution  limited  to  U.  S.  Gov't  agencies  only; 
test  and  evaluation;  October  1975.  Other  requests 
for  this  document  must  be  referred  to  RADC  (IRAP) , 
Griffiss  AFB  NY  13441. 


/D  D C 

r&iizm  nn 

JAN  26  ®76 


Rome  Air  Development  Center 
Air  Force  Systems  Command 
Griffiss  Air  Force  Base,  New  York  13441 


This  report  has  been  reviewed  and  approved  for  publication. 


APPROVED: 


RICHARD  S.  VONUSA 
Project  Engineer 


APPROVED: 


HOWARD  DAVIS 
Technical  Director 

Intelligence  & Reconnaissance  Division 


JOHN  P.  HUSS 

Acting  Chief,  Plans  Office 


Do  not  return  this  copy.  Retain  or  destroy. 


1 


J 


J 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  (When  D.l«  Entered) 


REPORT  DOCUMENTATION  PAGE 


b GOVT  ACCESSION  NO 


AUTOMATIC  CLASSIFICATION  OF  ^LANGUAGES  # | 


t: 


R,  Gary  Leonard  / 

George  R.^loddington / 


9.  PERFORMING  ORGANIZATION  NAME  AND  AODRESS 

Texas  Instruments  Incorporated  / 
13500  North  Central  Expressway 
Dallas  TX  75222 


H CONTROLLING  OFFICE  NAME  ANO  ADORESS 

Rome  Air  Development  Center  (IRAP) 
Griffiss  AFB  NY  13441 


© 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


3 RECIPIENT'S  CAT  ALOG  NUMBER 


r I AE#URI  ■ MLHIUd 

Final  Technical  depart. 

May  #74  - May  #75/. 

r.R?pajaftjuiJ 


CONTRACT  OR  GRANT  NUMBERf*J 


F306O2-74-C-O245  f /(O 


10,  PROGRAM  ELEMENT,  PROJECT,  T ASK 

AREA  ft  WORK  UNIT  NUMBERS 


31011F 
70550712 


i — — . — 

ii.  mi  uii i umi  ■*  / \ 

fll  0cu^#75  ( L 

'Hu  M'juyimjp  mix  r 

39  L 


(4  MONITORING  AGENCY  name  ft  AOORESSfU  dllterent  Pom  Conlro/lln*  0/(1  c.J 

Same 


Same  . — , , — 

l f- 


7 


IS,  SECURITY  CLASS,  (ol  thle  rsporIT' 

UNCLASSIFIED 


IS*  OECLASSI  FI  CATION  DOWNGRADING 

^schedule 


TtT  DISTRIBUTION  STATEMENT  (of  thle  Report) 

Distribution  limited  to  U.  S.  Gov't  agencies  only;  test  and  evaluation; 
October  1975.  Other  requests  for  this  document  must  be  referred  to  RADL 
(IRAP),  Griffiss  AFB  NY  13441. 


17.  DISTRIBUTION  STATEMENT  (of  the  ebetrecl  enfor.d  in  Block  30,  It  different  from  Report) 

Same 


18.  SUPPLEMENTARY  NOTES 

RADC  Project  Engineer: 
Richard  Vonusa  (IRAP) 


is"  KEY  WOR05  (Continue  on  rever.e  .Id.  It  n.c. every  end  Identity  by  block  number) 

Language  Classification 
Speech 

Pattern  Recognition 


^ — aqstrACT  (Continue  on  reverie  ride  It  neceseery  end  Identity  by  block  number)  ...  , , __ 

Studies  were  made  of  language  classification  algorithms  which  were  based  on 
the  recurrence  frequencies  of  sequences  of  phoneme-like  sound  segments. 

Sequences  of  length  2,  3,  4 and  5 were  considered.  The  sequence  recurrences 
comprised  acceptances  of  acoustic  hypotheses,  which  in  turn  are  based  on  a time- 
frequency  scanning  error  measure,  and  on  occurrence  time  relat  onsi  ps. 
Classification  performance  was  estimated  using  50  independent  test  speakers  of 
five  languages.  Individual  language  accuracies  ranged  from  17  to  90  percent, 
with  the  overall  five-language  accuracy  being  70  percent.  ^ — 


nn  FORM 

UU  I JAN  73 


1473  V EOITION  OF  I NOV  6S  IS  OBSOLETE 


unclassified 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  (Wien  Dele  Entered) 


<347  6 6'C 


PREFACE 
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Instruments  Incorporated,  Equipment  Group,  13500  North  Central  Expressway,  Dallas,  Texas, 
under  Contract  No.  F30602-74-C-0245  for  Rome  Air  Development  Center,  Griffiss  Air  Force 
Base,  New  York.  Mr.  Richard  S.  Vonusa  (IRAP)  was  the  RADC  Project  Engineer.  The  report 
covers  work  performed  from  May  1974  through  May  1975. 


TABLE  OF  CONTENTS 

Section  Title 

I INTRODUCTION 

II  DATA  PREPROCESSING 

1.  Analog  Preprocessing  

2.  Normalization  and  Quantization 

III  KEY  SOUND  DETECTION 

1.  Intersegment  Similarity  

2.  Time  Registration 

3.  Reference  File  Generation  

a.  Scanning  Error  

b.  Sequence  Detection 

c.  Preliminary  Sequence  Selection 

d.  Data  Processing  for  Occurrence  Information 

e.  Final  Reference  File  Formation 

IV  DECISION  RULES 

1.  Training  Data 

2.  Testing  Data  

3.  Combination  Decision  Strategy 

VI  CONCLUSIONS  AND  RECOMMENDATIONS  . . . 
REFERENCE  


Page 

. 5 

. 7 

. 7 

. 7 

. 11 

. 11 
. 11 
. 13 
. 13 
. 13 
. 14 
. 15 
. 17 

. 27 

. 29 
. 29 
. 30 

. 39 

. 41 


i 


LIST  OF  ILLUSTRATIONS 

Figure  Title 

1 Functional  Block  Diagram,  Analog  Spectral  Preprocessor  . . . . 

2 Spectral  Representation  of  Speed  Data  From  the  Word  “Warheads,” 

and  Auxiliary  Measures 

3 Data  Vectors  Used  to  Compute  Transitionitivity 

4 Frequency  Distribution  of  Values  of  E2 

5 Frequency  Distribution  of  Values  of  E3  (Scaled) 

6 Frequency  Distribution  of  Values  of  E4  (Scaled) 

Frequency  Dstribution  of  Values  of  Es  (Scaled) 

8 Classification  Errors  as  a Function  of  Acceptance  Level 

(Rules  A,  A*;  Training  Data) 

9 Classification  Errors  as  a Function  of  Acceptance  Level 

(Rules  B,  B*;  Training  Data) 

10  Classification  Errors  as  a Function  of  Entropy  Threshold 

(Training  Data) 


Page 

8 

10 

12 

19 

21 

23 

25 

31 

32 

33 


■ 


3 


11 

Classification  Errors  as  a Function  of  Sequence  Length 

34 

12 

(Training  Data) . 

Classification  Errors  as  a Function  of  Number  of  Retained 

35 

13 

References  (Training  Data) 

Classification  Errors  as  a Function  of  Acceptance  Level 

36 

14 

Classification  Errors  as  a Function  of  Entropy  Threshold 

37 

15 

( ICSttrig  Uaia) • • n 1 

Confusion  Matrix  From  Use  of  Combination  Decision  Rule 

3o 

LIST  OF  TABLES 

_.  , Page 

_ Title 

Table 

7 

I Filter  Bank  Parameters  9 

II  Components  of  Reduced  Spectrum  Data  Vector ,3 

III  Numbers  of  Extracted  Scanning  Patterns ,5 

IV  Similarity  Thresholds 15 

V Add  Thresholds  16 

VI  File  Reduction  Parameters _ 16 

VII  Numbers  of  Reference  Sequences  in  Reduced  Files 17 

VIII  Sequence  Acceptance  Thresholds 


4 


SECTION  I 

INTRODUCTION 

The  problem  studied  is  to  design  and  simulate  a system  that  will  automatically  determine 
to  which  of  several  specified  languages  a given  segment  of  speech  belongs,  and  to  do  this  with 
small  probability  of  error.  This  report  contains  the  results  of  the  second  phase  of  study  of  such  a 
system.  In  the  first  phase  (Reference  1]  classification  was  based  on  language  likelihoods 
computed  for  certain  reference  sounds.  Those  sounds  were  short  phoneme-like  segments.  In  the 
second  phase,  classification  is  based  on  sequences  of  several  such  segments,  which  allows  more 
accuracy  and  reliability  in  the  automatic  segmentation  process.  Another  improvement  is  the  use 
of  “time-frequency  scanning”  to  accept  or  reject  hypothesized  occurrences  of  component  sound 
segments. 

The  algorithms  discussed  here  automatically  produce  the  information  needed  for  language 
discrimination  without  reference  to  the  particular  languages  being  considered.  This  approach 
allows  treatment  of  additional  languages  with  little  additional  effort  and  with  no  need  for  special 
knowledge  of  those  languages. 

Two  separate  data  sets  of  equal  size  have  been  used  in  this  study.  The  first  is  used  to 
design  the  classifier,  and  the  second  is  used  for  estimating  the  probability  of  error  to  be  expected 
in  using  the  classifier.  The  data  sets  used  in  this  second  phase  of  the  study  are  the  same  as  those 
used  in  the  first  phase. 

A decision  rule  based  on  occurrence  frequency  information  about  sequences  ol  length  5 
and  single  segments  allowed  five-language  classification  accuracy  of  70  percent. 
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SECTION  II 

DATA  PREPROCESSING 


Anaioe  speech  data  recordings  tor  this  study  were  provided  by  RADC.  The  data  are  from 
- 1 ^ t i i i i i and  l This  designation  is  adequate  since  the  algorithms 

^ identically.  No  process,,,*  is  tailored  to 

specific  languages  Data  from  100  distinct  speakers  were  processed,  50  speakers  provided 
estiniating  ^ecis^on  parameters  and  generating  reference  files  (“training  data");  and  data  from  50 
other  speakers  (“testing  data”)  were  used  to  provide  unbiased  estimates  of  decision  accuracy. 

The  training  data  consisted  of  90-second  segments  of  speech  from  each  of  10  speakers  of 
each  of  the  five  languages.  The  testing  data  comprise  90-second  segments  from:  10  speakers 
L,  l"  and  L5;  6 speakers  of  L2 ; and  14  speakers  of  L4.  The  testing  data  were  used  only  to 

determine  performance. 

1.  ANALOG  PREPROCESSING 

The  analog  speech  data  base  was  preprocessed  using  the  hardware  shown  functionally  in 
Figure  I Fifteen  bandpass  filters  were  used  to  provide  a time-frequency  signal  analysis.  The 
S frequencies  and  bandwidths  of  these  filters  are  shown  in  Table  I.  Following  the  low-pass 
niSg  the  signals  in  each  channel  are  sampled,  digitized  to  1 1 bits,  and  stored  for  add, bona 
processing  One  hundred  samples  per  second  are  retained  to  represent  the  speech  information^  In 
the  following  description,  g will  denote  a column  vector  of  data  values _stored  aMome  spec,  |e 
lime  Since  the  operations  performed  at  each  sampling  time  are  identical,  reference  to 
specific  time  will  be  suppressed.  The  symbol  g’  = [g,  gr  ■ ■ • fel  will  denote  the  transpose  ol  the 

column  vector  g. 


Z. 


normalization  and  quantization 


Some  speaker  normalization  is  accom- 
plished by  regressing  the  data  vector  g upon 
regression  vectors  chosen  to  maximize 
between-speaker  to  within-speaker  variance  in 
the  time-frequency  spectrum  [Reference  1 ] . 
The  expression  for  the  normalized  data  vector 


is 


gN  = */°  f 

tvhere  T is  the  original  data  vector  g with  the 
components  along  the  regression  vectors  sub- 
tracted out,  and 


= f • f = 


£ 


TABLE  I.  FILTER 
BANK  PARAMETERS 


Filter 

dumber 

Frequency 

(Hz) 

Bandwidth 

(Hz) 

1 

355 

220 

2 

530 

220 

3 

705 

220 

4 

880 

220 

5 

1055 

220 

6 

1230 

220 

7 

1405 

220 

8 

1580 

220 

9 

1755 

220 

10 

1930 

220 

11 

2105 

220 

12 

2280 

220 

13 

2455 

220 

14 

3500 

1000 

15 

6500 

3000 
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Figure  1.  Functional  Block  Diagram,  Analog  Spectral  Preprocessor 


The  value  of  a is  used  as  a measure  of  the  overall  energy  level  of  the  speech.  Letting  g,  denote 
the  i th  component  of  the  normalized  data  vector  gN  , and  letting  ct  and  c2  denote  the  regression 
coefficients  for  the  first  and  second  order  regression  vectors,  the  j th  component  fj  of  the  final 
data  vector  f is  shown  in  Table  11,  for  j = 1,  2, . . .,  12.  Thus,  f is  a 12-element  vector,  nine 
elements  of  which  are  from  normalized  filter  outputs,  along  with  two  regression  coefficients  and 
the  overall  energy  measure.  Following  normalization,  each  element  of  the  data  vector  f is 
quantized  to  3 hits  and  stored  on  digital  magnetic  tape.  Figure  2 shows  data  vectors  from  several 
consecutive  time  samples  forming  a quantized  and  digitized  speech  spectrum  for  the  spoken  word 
“warheads.”  The  original  spectrum,  the  reduced  spectrum,  and  the  auxiliary  measures  c, , c2,  and 
u are  shown  for  this  sample. 

In  following  sections,  computations  will  be  made  which  use  the  data  from  certain  specified 
time  samples.  These  data  will  be  considered  as  a matrix,  each  column  of  which  is  a 12-tuple  of 
the  form  just  defined.  For  example,  the  matrix  of  preprocessed  data  at  times  t 2,  t — 1,  t, 
t + 1 , and  t + 2 will  be  written 

P = [f(t  — 2)  f(t  — 1 ) f(t)  f(t+  1)  f(t  + 2)] 


TABLE  11  COMPONENTS  OF  REDUCED  SPECTRUM  DATA  VECTOR 


Component 


Composition 

gt 

'A  (g2  + g3) 

A (g<  + gs) 

'A  (g6  + g7) 

A (g8  + gs) 

'A  (gio  + gn) 
A (g  12  + go) 
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Representation  of  Speed  Data  From  the  Word  “W'arheads,”  and  Auxiliary  Measures 


SECTION  III 

KEY  SOUND  DETECTION 

The  general  solution  of  the  language  classification  problem  is  to  first  find , ^ 

— :r;,Lr  rirs 

li ir  r^e— ^ «1  representation  ofthe  speech  da, a” . 

- - - - =— 

language  likelihoods. 

1.  INTERSEGMENT  SIMILARITY 

Extensive  use  is  made  of  the  squared  error  between  two  matrices  (vectors)  representing 

c c - rr  ml  anH  C = fe-fi)l  each  comprise  M data  vectors,  1 1,2,  , 

sound  segments.  Suppose  F-  [fj(j)l  aild  G t£iU)l  eacn  i t°mpn  ,pr,  . . r.ned  to  be 

1 2-  j = 1 , 2,  ....  M.  Then  the  squared  error  between  F and  G,  written  e(F,G),  deli 


M 
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e(F,G)  = 


U'iO)  g,(j)]; 


i=i  i=i 


2.  TIME  REGISTRATION  > 

Points  of  time  registration  in  the  data  are  defined  in  terms  of  the  overall  energy  measure 
and  a transitionitivity  function,  T,  a real-valued  function  of  time  which  reflects  the  magnitu 
of  dynamic  spectral  change.  Let  R(t)  denote  the  12  X3  matrix  consisting  of  data  vectors  rom 

three  consecutive  sampling  times  centered  at  time  t;  i.e., 

R(t)  = [f(t  - 1)  f(t)  f(t  + 1)1 

Then,  the  transitionitivity  a,  time  t,  T(t),  is  defined  to  be  the  squared  error  between  R(t  - 1) 
and  R(t  + 1 )'■ 

V 

T(t)  = e[R(t  - 1),  R(t  + 1)1 

Figure  3 illustrates  the  forma,  of  the  data  vectors  used  to  compute  T If  TO)  “ S“U,  then  the 
two  matrices  are  similar  and  t is  a time  of  relatively  steady-state  speech.  If  T(t)  is  large, 
a time  of  transition  in  the  spectrum. 

During  data  processing,  the  T function  is  computed  at  each  sample  time  and  is  monitored 
to  determine  "fand  valleys.  Each  peak  in  the  T function  is  UbeW  « 
point  provided  that  (1)  ihe  value  of  the  peak  is  greater  than  50,  and  (2)  the  smallest  values  oi 
file  overall  energy  measure  a in  a 0.1 -second  neighborhood  about  that  time  is  greater  than  .80 
Condition  (1)  is  imposed  to  overlook  those  small  spectral  changes  which  probably  are  not 
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Figure  3.  Data  Vectors  Used  to  Compute  Transitionitivity 


phoneme  boundaries,  and  condition  (2)  prevents  consideration  of  silence  segments.  The  threshold 
levels  were  determined  as  a result  of  inspection  of  speech  spectra. 

At  each  registration  time,  a scanning  pattern  (SP)  is  extracted  and  stored  for  use  in 
representing  candidate  reference  sequences.  Let  S(t)  denote  a typical  SP  extracted  at  time  t. 
Then  S(t)  consists  of  three  derived  data  vectors 

S(t)  = [g!  g2  g3) 

where 

gl  = V2 [ f( t - 2)  + f(t  - 1)] 
g2  = «/z[f(t  + 1)  + f(t  + 2)1 
83  = 81  — 82  • 

The  data  vectors  marked  with  asterisks  in  Figure  3 are  used  in  determining  an  SP  at  time  t. 

The  first  90  seconds  of  speech  data  from  each  of  the  50  training  speakers  was  processed  to 
extract  and  store  scanning  patterns  at  each  time  registration  point.  Table  111  shows  the  numbers 
of  SPs  extracted. 
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TABLE  III.  NUMBERS  OF  EXTRACTED  SCANNING  PATTERNS 


Language 

Number  of  Scanning  Patterns 


L, 

5,399 


L2 

6,759 


L3 

5,012 


U 

5,699 


Ls 

6,345 


3.  REFERENCE  FILE  GENERATION 

Any  sequence  of  k(k  = 2,  3,  4,  and  5)  consecutive  scanning  patterns  separated  by  no  more 
than  0.25  second  was  considered  to  represent  a potentially  useful  sound  sequence.  The  total 
number  of  these  candidate  reference  sequences  (approximately  1 16,000,  as  seen  from  Table  I I) 
was  too  large  to  process  the  desired  amount  of  training  and  testing.  Hence,  the  procedures  in  the 
following  sections  were  adopted  to  choose  a small  subset  of  these  candidate  reference  sequences 
for  use  in  the  final  language  classifier.  The  object  of  the  procedures  is  to  find  those  sequences 
which  occur  often  in  the  data  and  are  distinct  from  each  other.  The  procedures  to  be  described 
were  performed  for  each  value  k,  k = 2,  3,  4,  and  5.  Hence,  specific  reference  to  the  particular 
sequence  length  will  often  be  omitted. 


a.  Scanning  Error 

In  ascertaining  the  occurrence  of  a sequence,  use  is  made  of  the  “scanning  error,  h,  a 
real-valued  function  of  a fixed  scanning  pattern,  S,  and  of  time,  t.  Let  Ht)  denote  the  derive 
data  matrix  [g!  g2  g3  1 > where 

gi  = Vi[ f(t  - 2)  + f(t  - 1)1 
g2  = ‘/:[f(t  + 1)  + f(t  + 2)1 
g3  = Si  ~ 82 

f(t)  = data  vector  at  time  t. 

Note  that  F has  the  same  format  as  the  scanning  pattern.  Then  the  scanning  error  E(S,t)  for 
scanning  pattern  S is  defined  to  be 


E(S,t)  = e[S,  F(t)] 

A relative  minimum  in  the  scanning  error  for  S indicates  a time  at  which  the  speech  data  is 
similar  to  the  scanning  pattern  S. 

b.  Sequence  Detection 

The  recurrence  of  reference  sequences  is  detected  by  first  scanning  the  input  data  to 
hypothesize  the  occurrence  of  appropriate  time  registration  points,  and  then  hypothesizing 
sequence  recurrence  when  the  relationships  among  hypothesized  time  registration  points  corre- 
spond to  those  in  the  reference  sequence.  Rejection  of  sequence  recurrence  is  based  on  spectral 
similarity  between  reference  sequence  scanning  patterns  and  the  input  data. 

To  be  more  specific,  consider  the  detection  of  reference  sequences  of  length  2.  Let  (S, , 
S2)  denote  a reference  pair,  where  Sj  is  the  scanning  pattern  extracted  from  the  training  data  at 
time  T-,  i=  1,  2.  Let  At  denote  |T2  — T,  1,  the  time  separation  between  scanning  patterns,  and 


assume  that  T|  <T2.  During  processing  of  the  input  data,  the  scanning  errors  1 (Si,  t)  i = 1,  2 are 
computed  and  monitored  to  label  the  valleys  in  each  o(  these  scanning  error  (unctions.  These 
valley  points  are  hypothesized  to  be  time  registration  points  lor  the  corresponding  scanning 
pattern.  Whenever  a valley  in  H( St , t > precedes  a vailey  in  E(S2,t)  by  less  than  0 25  second,  the 
recurrence  of  the  reference  pair  (S, , S2 ) is  hypothesized. 

The  basis  lor  acceptance  or  rejection  of  this  hypothesis  is  the  pair  error,  denoted  E2.  Let 
tj  denote  the  time  of  occurrence  of  a valley  in  E(S,,  t),  i=!,2,  and  let  At  = t2  t,|.  Let 
e,  = L(Sj,  tj)  denote  the  value  of  the  scanning  error  at  the  valley  time  tj,  i = 1,2.  Assume  that 
|t,  t2|  < 0.25  second,  so  that  the  recurrence  of  (S, , S2)  is  hypothesized.  This  hypothesis  is 

rejected  if  and  only  if  E2  > L\1AX2,  where  EMAX2  is  a fixed  threshold  and  i 2 is  delined  to  be 


4|At  - A?| 
max(5,  aT) 

It  can  be  seen  that,  for  detecting  the  occurrence  of  a reference  pair  (S,  , S2 ),  the  time  separation 
of  registration  times  in  the  data  must  be  close  to  that  of  the  scanning  patterns,  and  also  the 
scanning  patterns  must  each  be  similar  to  corresponding  transitions  found  in  the  data. 

For  detection  of  sequences  of  length  k,  k = 3,  4,  and  5,  the  scanning  error  valleys  for  each 
of  the  k scanning  patterns  are  recorded,  and  an  occurrence  hypothesized  whenever  (l)k  scanning 
error  valleys,  one  from  each  scanning  pattern,  occur  in  the  same  order  as  did  the  scanning 
patterns;  and  (2)  no  time  interval  between  adjacent  valleys  is  greater  than  0.25  second.  The  pair 
errors  between  each  of  the  k - I pairs  of  valleys  are  summed  to  form  a k-sequence  error,  Ek. 
The  hypothesized  occurrence  of  the  sequence  is  rejected  if  and  only  if  Ek  > EMAXk,  where 
EMAXk  is  a specified  threshold  for  sequences  of  length  k. 


(e,  + 40)  (e2  +40) 
2048 


c.  Preliminary  Sequence  Selection 


Because  the  average  processing  time  needed  to  detect  occurrences  of  a single  reference 
sequence  in  all  the  training  and  testing  data  is  approximately  20  minutes,  most  of  the 
approximately  116,000  candidate  reference  sequences  must  be  eliminated  from  consideration.  As 
a first  step  toward  this  goal,  a study  was  made  to  determine  what  constitutes  similarity  among 
candidate  reference  sequences.  Let  P,  = (S,,  , S12  ) and  P2  = (S2]  , S22  ) denote  candidate 
reference  sequences,  where  Sjj  is  a scanning  pattern  which  occurred  at  time  t jj , i,  j = 1,2.  Let  e, 
and  e2  denote  the  squared  error  e(Su  , S21  ) and  e(SI2  , S22  ),  respectively,  and  define  At  = |t12 
— t, , |,  and  At  = |t22  - t21 1.  Then  define  the  similarity  between  P,  and  P2  to  be 


Es  (P,  ,P2 ) = 


(e,  + 40)  (e2  +40) 
2048 


4|At  - Atj 
max(5,  At) 


Note  the  intentional  analogy  to  the  pair  error  E2. 

Let  T,  = {S,:  1=1,2...,  100}  denote  a set  containing  the  first  100  candidate 

reference  sequences  of  length  k hypothesized  from  the  first  training  speaker  of  language  L2 . A 
cumulative  frequency  distribution  was  plotted  for  the  values  of  the  similarities  Es(Si;  Sj),  1 < i < 
j < 100.  A similarity  level  j3  was  determined  such  that  P Es(Pj,  Pj)  < j3  = 0.75.  The  values  oi  0 
determined  for  each  k are  shown  in  Table  IV 
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TABLE  IV.  SIMILARITY  THRESHOLD 


1 


Sequence  Length  2 3 4 

Threshold  49  92  132  176 


Then,  all  the  candidate  reference  sequences  from  each  language  were  ordered  according  to 
the  following  procedure.  Let  the  language  and  the  sequence  length  be  fixed.  A subset  I2 
consisting  of  10  percent  of  the  candidate  reference  sequences  from  each  of  the  10  training 

speakers  was  formed.  For  every  candidate  reference  sequence  S from  the  language  considered, 

the  similarities  Es(S,  R),  ReT2,  were  computed,  and  the  number  N(R)  of  similarity  values  less 
than  (3  (for  sequence  R)  were  noted.  All  candidate  reference  sequences  were  then  ordered 

according  to  N,  with  the  first  sequence  being  the  one  with  the  highest  value  of  N.  This 

procedure  places  that  sequence  occurring  more  frequently  higher  in  the  ordering. 


! 


i 


t i 

I i 

i 

■ i 

I | 
I ? 


A reduced  file  of  candidate  reference  sequences  is  formed  for  each  language  and  each 
sequence  length  by  first  placing  in  the  file  that  sequence  which  is  first  in  the  above  ordering,  and 
then  adding  succeeding  sequences  in  the  ordering,  provided  that  the  minimum  of  the  similarities 
between  the  sequence  to  be  added  and  each  sequence  already  in  the  reduced  file  is  greater  than  a 
fixed  threshold.  This  ensures  that  each  sequence  in  the  reduced  file  is  distinct  from  all  others  in 
the  file. 

The  procedure  for  determining  the  add  threshold  is  as  follows.  A small  file  of  candidate 
reference  sequences  from  34  consecutive  scanning  patterns  from  one  training  speaker  was 
formed.  Occurrences  of  these  sequences  were  detected  in  10  seconds  of  data  from  each  of  the 
nine  other  training  speakers  of  the  language.  The  sequence  rejection  threshold  EMAXk  was  set 
large  enough  so  that  all  hypothesized  sequences  and  their  corresponding  Ek  values  are  recorded. 
The  cumulative  frequency  distribution  of  the  values  of  Ek  was  plotted  and  the  add  threshold  a 
was  chosen  such  that  P |Ek  < a } = 0.98,  Table  V shows  the  add  thresholds  for  each  language 
and  each  sequence  length. 

The  total  numbers  of  hypothesized  sequences  from  the  experiment  described  in  the 
preceding  paragraph  are  shown  in  Table  VI  for  each  language  and  each  sequence  length.  These 
numbers  were  used  to  determine  the  proportions  of  the  desired  numbers  of  reference  sequences 
of  the  various  lengths.  These  proportions  are  the  inverses  of  the  ratios  existing  among  the  total 
number  of  hypothesized  sequences.  That  is,  if  there  were  twice  as  many  pairs  hypothesized  as 
there  were  triples,  it  is  desired  to  have  half  as  many  reference  pairs  as  reference  triples  in  the 
reduced  file.  The  total  numbers  of  reference  sequences  added  to  the  reduced  file  was  restricted 
by  available  processing  time  and  computer  core  storage  limitations.  The  numbers  of  reference 
sequences  retained  in  the  reduced  files  are 

shown  in  Table  VI  for  each  language  and  each  TABLE  V.  ADD  THRESHOLDS 

sequence  length.  The  total  numbers  of  reference 


sequences  of  each  length  are  shown  in  Table  VII. 

Sequence  Length 

A total  of  452  reference  sequences  remained  in 

Language 

2 

3 

4 

5 

the  reduced  file,  denoted  F. 

L, 

94 

152 

204 

256 

d.  Data  Processing  for  Occurrence  Infor- 

^2 

96 

156 

210 

264 

mation 

L3 

104 

168 

228 

296 

Both  the  training  data  and  the  testing  data 

l4 

84 

140 

186 

224 

(90  seconds  of  speech  from  each  of  100 

Ls 

92 

148 

214 

248 

15 


> 


i. 

I 

i 


t 

ii 


I 


i 

4 


i 


TABLE  VI.  FILE  REDUCTION  PARAMETERS 


Language 

Index 

Seq uenee 
Length 

Number  of 
Hypothesized 
Sequences 

Number  of 
Sequences 
Retained 

1 

■> 

10,817 

10 

1 

3 

5,384 

21 

1 

4 

2,810 

42 

1 

5 

1,447 

19 

2 

2 

14,846 

10 

2 

3 

7.625 

21 

o 

4 

4,02<> 

40 

2 

5 

2,164 

18 

3 

2 

13,574 

7 

3 

3 

5,678 

15 

3 

4 

1 ,646 

45 

3 

5 

532 

21 

4 

2 

18,880 

14 

4 

3 

11,833 

23 

4 

4 

7,260 

37 

4 

5 

4,027 

19 

5 

2 

12,243 

9 

5 

3 

5,715 

20 

5 

4 

2,670 

41 

5 

5 

1 ,249 

19 

TABLE  VII.  NUMBER  OF  REFERENCE 
SEQUENCES  IN  REDUCED  FILES 

Sequence  Length  2 3 4 5 Total 

Number  of  Sequences  51  100  205  96  452 


speakers)  were  processed  to  detect  the  occurrences  of  each  sequence  in  the  reference  file  F.  The 
values  of  EMAXk  (the  rejection  level  for  hypothesized  sequences)  was  set  large  enough  that  all 
hypothesized  sequences  were  accepted.  For  each  accepted  sequence,  record  was  made  ol  (l)the 
index  S of  the  speaker  whose  data  was  being  processed  (and,  hence,  of  his  language  L),  (2)  the 
index  R of  the  accepted  reference  sequence  (and,  hence,  its  length  k),  and  (3)  the  value  Fk  ol 
the  overall  sequence  error.  This  processing  required  approximately  150  hours  of  computer  time, 
using  a T1  980A  minicomputer. 

To  determine  the  effects  of  varying  the  rejection  level  for  detecting  sequence  recurrence, 
six  sets  of  thresholds  EMAXk  were  used  in  turn  to  determine  an  array  N(R,  S,  L),  where  an 
entry  in  this  array  is  the  number  of  occurrences  ol  reference  sequence  R during  processing  ol 
data  from  speakers  of  language  L.  (As  previously  described,  an  occurrence  is  counted  whenever 
Et  <EMAXk.)  One  set  of  thresholds  (Case  0)  is  the  one  mentioned  in  the  previous  paragraph. 


which  yields  all  hypothesi/.ed  sequences.  The  other  sets  contain  successively  lower  thiesholds, 
thereby  requiring  successively  better  match  between  relerence  and  data  to  yield  an  occurrence 
Case  i,  i = 1,2,...,  5,  are  obtained  as  follows,  hirst,  an  empirical  probability  density  plot  was 
obtained  for  values  of  liT  from  training  data  for  Case  0.  figures  4,  5,  6,  and  7 show  this  density 
for  sequence  length  2,  3,  4,  and  5,  respectively.  Let  NT  denote  the  total  number  of  sequences 
hypothesized  (for  some  sequence  length),  from  the  corresponding  distribution,  threshold  values 
were  determined  which  would  yield  NT/2'  detected  sequences,  lor  Case  i,  i-  1,  2,  3,4,  5.  This 
procedure  was  followed  for  each  value  of  sequence  length.  The  resulting  thresholds  are  shown  in 
Table  VI 11.  Analysis  of  the  results  of  using  the  six  cases  are  presented  in  Section  V. 


TABLE  VIII.  SEQUENCE  ACCEPTANCE  THRESHOLDS 


Case 

Length 

0 

1 

2 

3 4 

5 

2 

*999 

29 

18 

13  9 

7 

3 

999 

61 

41 

31  24 

19 

4 

999 

96 

69 

53  42 

34 

5 

999 

119 

87 

69  56 

46 

’Threshold  of  999 

allows  acceptance 

of  every  hypothesized  sequence. 

c.  Filial  Reference  File  Formation 

The  final  reference  file,  F*,  of  sequences  to  be  used  in  computing  decision  functions  is 
formed  by  deleting  from  F those  reference  sequences  which  had  too  little  language  specificity. 
Such  sequences  were  determined  by  considering  the  average  information  remaining  (uncertainty, 
entropy)  after  detection  of  a reference  sequence  in  the  training  data.  The  lower  this  uncertainty, 
the  better  is  the  language  discrimination  capability  of  that  sequence.  Specifically,  the  entropy 
H(R)  associated  with  the  detection  of  sequence  R is 

5 

H(R)  = P(Li'R)  l0g  lp(Li|R)1 

i=l 

where  p(Lj|R)  is  the  language  likelihood,  given  that  R has  occurred.  Let  the  symbol  Sj  denote 
the  collection  of  training  speakers  of  language  L;,  i = 1 , 2,  . . . 5,  and  let  M(R,  L)  denote  the 
number  of  detections  of  reference  sequence  R during  speech  from  all  training  speakers  ol 
language  L.  Then 

M(R,  Lj)  = ^ N(R,  S,  L^,  i = 1 , 2,  . . . 5 

SESj 
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I he  likelihood  ptLpR)  is  computed  as 


l»(L,|R)  = 


q(R,  Lj) 

s 

^q(R,  L,) 

i=  l 


where 

M(R,  L,) 

q(R,  Lj)  = 

M(R,  Li) 

Rtl 


The  Final  File  F*  is  then 

F*  = {ReF  : H(  R)  < H0( 

where  H0  is  a Fixed  entropy  threshold.  Base  2 logarithms  were  used  in  computations;  the 
maximum  entropy  possible  is  log2  5 = 2.322. 
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Figure  4.  Frequency  Distribution  of  Values  of  Ej 
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Figure  6.  Frequency  Distribution  of 
Values  of  E4  (Scaled) 
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NOTE  2 OCCURRENCES  GREATER  THAN  115 
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Figure  7.  Frequency  Distribution  of 
Values  of  Es  (Scaled) 


SECTION  IV 
DECISION  RULES 


Tins  section  describes  the  decision  rules  used  to  clussity  data  troin  test  speakers.  The 
general  approach  is  to  process  a specified  amount  (90  seconds)  of  speech  data  to  he  classified 
(for  example,  from  speaker  S).  detecting  the  occurrences  of  the  references  in  F*.  Let  R - 
j Rn  l denote  the  sequence  of  detected  relerence  sequences  ot  a specitied  length  K 

is  taken  to  be  the  representation  of  the  speech  data  troin  speaker  S.  Then  the  language 
likelihoods  described  above  and  the  occurrence  statistics  of  the  sequence  R are  used  to  compute 
decision  parameters  which  are  used  to  classify  the  data  as  being  from  one  of  the  five  languages 
considered.  Implementation  of  the  decision  strategies  was  carried  out  separately  tor  each  value  ot 
sequence  length  k,  k = 2,3,4,  and  5. 


Let  pj(R)  denote  the  probability  density  function  for  the  sequence  R,  given  that  language 
L was  spoken,  i = 1,2, 3, 4, 5.  Letting  P(L)  denote  the  a priori  probability  for  language  L,  the 
decision  rule  which  is  optimum  in  that  it  incurs  the  lowest  possible  probability  of  misclassification 
can  be  stated  as:  observe  the  sequence  R and  choose  the  language  Lj  for  which 


P(Lj)  Pj(R)  ^ P(Lj)  Pi(R)  for  i = 1,2, 3, 4, 5 

In  practice,  neither  the  a priori  language  probabilities  nor  the  conditional  sequence  densities  are 
explicitly  known.  Hence,  approximations  to  the  optimum  rule  must  be  used,  and  less  than 
optimum  results  must  be  tolerated. 

The  basic  strategy  is  to  assume  independence  of  the  detected  sequences  of  length  k and 
then  choose  the  language  which  maximizes  the  resultant  expression  for  the  log  likelihood  of  the 
hypothesized  languages  given  the  observation  of  the  data  representation  R.  Let  DF,(S,L)  denote 
the  unnormalized  decision  function  value  computed  for  test  speaker  S and  hypothesized  language 
Le  | Li , L2 , L3 , L4 , Ls  | . Then 


DF'i  (S,  L) 


^ ' log  p(L[R) 

RcR 


where  the  summation  is  taken  over  reference  sequences  in  the  data  representation  R tor  speaker 
S.  The  following  normalization  is  made.  Define 


DF2(S,  L)  = 


DF,(S,  L) 


Z 


DF]  (S,  L) 


where  the  summation  is  over  all  50  test  speakers.  Then  let  DF(S,  L)  denote  the  normalized 
decision  function  used  to  classify  the  test  speakers,  and  define 


DF(S,  L)  = 


DF2(S,  L) 


T;  |df2(s,  Lj)(: 


i=l 
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M-j.  ...jA 


Decision  rule  A is  to  choose,  for  speaker  S,  the  language  Le  j L, , L2 , L3 , L4 , Ls  j for  which 
DF(S,L)  is  smallest 


A second  strategy  is  based  on  maximizing  the  correlation  between  the  decision  function 
values  defined  above  for  the  test  speaker  and  normalized  decision  function  values.  Let  Lj  and  Lj 
denote  actual  language  and  hypothesized  language,  respectively,  i.  j = 1,2, 3, 4, 5.  Let 


and  then  define 


D, (L,,  Lj) 


DF(S,  L^ 


D(L„  Lj) 


Pi  (Lj,  Lj) 

{d.  (Li , Lj)}2 


Decision  rule  B states:  Compute  the  correlation 

5 

p*(S,  L^  = D(Lj,  Lj)  DF(S,  Lj) 
j=l 

and  choose,  for  speaker  S,  that  language  Le  | L, , L2,  L3,  L4,  L5 } such  that  p(S,L)  = 1 
p*(S,L)  is  smallest. 


Each  of  these  decision  rules  was  implemented  for  each  of  four  sequence  lengths  k = 2,3,4, 
and  5.  Additional  decision  rules  were  used  which  were  based  on  results  for  the  four  sequence 
lengths  combined.  To  exhibit  the  dependence  of  the  decision  parameters  on  sequence  length,  k, 
define 


DF'  (S,L,k)  = DF(S,L) 


for  sequence  length  k = 2,3,4,  and  5.  Then  define 

5 

DFC(S,  L)  = ^ DF'  (S,  L,  k) 

k=2 

Decision  rules  A*  and  B*  result  from  using  DFC  instead  of  DF2  and  then  making  the  succeeding 
computations  in  the  same  manner  as  for  rules  A and  B,  respectively. 
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SECTION  V 

CLASSIFICATION  RESULTS 

This  section  describes  the  classification  results  obtained  using  the  various  decision  rules 
described  in  Section  IV.  Many  classification  experiments  were  performed  to  determine  the  exact 
structure  of  a classifier  which  performs  well  with  the  training  data.  The  idea  is  to  design  the 
classifier  using  the  training  data  and  then  classify  the  test  speakers  to  estimate  the  probability  of 
correct  classification  associated  with  that  classifier.  Variables  that  required  specification  were: 

( 1 ) acceptance  level  for  hypothesized  sequences;  (2)  entropy  threshold  for  selection  ol  reference 
sequences  with  sufficient  language  specificity;  (3)  decision  strategy;  and  (4)  sequence  length. 

1.  TRAINING  DATA 

Figures  8 through  12  show  the  numbers  of  errors  from  classification  of  the  50  training 
speakers  as  functions  of  the  variables  mentioned  above.  Figures  8 and  9 show  (for  decision  rules 
A,  A*  and  B,  B*,  respectively)  the  errors  as  a function  of  the  acceptance  level  for  acceptance  of 
the  hypothesis  that  a reference  sequence  has  recurred.  The  cases  for  the  various  levels  were 
described  in  Subsection  111.3. d and  were  labeled  0,  1,2,  3,  4,  5,  as  in  the  figures.  These  cases 
correspond  to  100-,  50-,  25-,  12.5-,  6.25-,  and  3.125-percent  acceptance  of  all  hypothesized 
sequences.  These  figures  include  results  for  each  sequence  length  k = 2,3, 4, 5,  and  for  each  result 
plotted,  the  entropy  threshold  used  was  H0  = 2.3. 

Figure  10  shows  the  errors  incurred  as  a function  ol  the  entropy  threshold  lor  selection  of 
reference  sequences.  This  parameter  determines  the  total  numbers  of  relerences  of  each  length 
remaining  in  the  file  used  to  compute  decision  function  values.  In  each  case  shown,  data  tram 
the  12.5-percent  acceptance  case  was  used. 

Figure  11  shows  the  errors  as  a function  of  sequence  length  for  the  case:  (1)H0  = 2.3, 
(2)  12.5-percent  acceptance,  and  (3)  decision  rules  A and  B.  This  case  was  chosen  because  results 
for  it  were  at  least  as  good  as  for  the  other  situations.  For  rule  A of  this  same  case,  Figure  1 - 
shows  error  as  a function  of  the  number  of  reference  sequences  remaining  in  the  final  reference 

file. 


It  is  seen  that,  for  the  training  speakers,  Case  3 data  (12.5-percent  acceptance),  an  entropy 
threshold  of  H0  = 2.3,  and  decision  rule  A using  sequences  of  length  4 yielded  the  best 
classification  performance:  88-percent  correct  five-language  classification. 

2.  TESTING  DATA 

Figures  13  and  14  show  the  numbers  of  errors  resulting  from  classilying  the  50  test 
speakers  for  various  decision  rules  and  values  of  the  parameters.  The  parameters  determined  from 
the  training  data  experiments  provided  the  basis  for  choosing  parameters  for  classification 
experiments  with  the  testing  data.  Figure  13  shows  performance  as  a function  of  acceptance  level 
and  Figure  14  shows  performance  as  a function  of  entropy  threshold.  It  is  seen  that  Case  3 data 
and  an  entropy  threshold  of  H0  =2.3  yield  the  fewest  errors  provided  that  decision  rule  B was 
used  with  sequences  of  length  5.  This  choice  yielded  62-percent  correct  five-language  classifi- 
cation. 
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3.  COMBINATION  DECISION  STRATEGY 


A decision  rule  vv;is  formulated  which  used  sequences  of  length  k = 5 as  well  us  single 
segments,  as  developed  in  the  first  phase  of  this  study.  During  the  first  phase,  values  of  a 
decision  function  D'(S.L)  were  determined  (Table  XI,  reference  I)  which  were  computed  from 
the  same  test  data,  A decision  function  which  combined  the  results  of  the  two  phases  of  this 
study  was  computed  (for  i = 1,2, 3, 4, 5)  to  be: 


l>  (S,  L,)  = 


P(S,  Lj) 

T.  P<Sj.  Lj) 


D'(S,  Lj) 


D 1 ( Sj , Lj) 


where  p(S,Lj)  was  computed  for  sequence  length  k = 5.  Choosing  the  language  which  yielded  the 
smallest  value  for  L>  (S.L)  to  classify  the  test  speaker  S yielded  the  confusion  matrix  shown  in 
figure  15.  The  overall  classification  accuracy  resulting  from  this  combination  rule  is  70  percent. 


Figure  10.  Classification  Errors  as  a Function  of  Entropy  Threshold  (Training  Data) 


Figure  12.  Classification  Errors  as  a Function  of  Number  of 
Retained  References  (Training  Data) 
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Figure  14.  Classification  Errors  as  a Function  of  Entropy  Threshold  (Testing  Data) 


SECTION  VI 

CONCLUSIONS  AND  RECOMMENDATIONS 


In  this  study  language  classification  was  based  on  sequences  oi  1,  2,  3,  4,  and  5 
phoneme-like  sound  segments.  The  approach  taken  treated  each  language  identically,  without 
special  linguistic  considerations.  Time-frequency  scanning  was  used  to  hypothesize  time  registra- 
tion points  and  generate  candidate  reference  sequences.  Relationships  among  occurrence  times, 
speech  energy,  and  scanning  errors  were  used  to  hypothesize  the  recurrence  of  reference 
sequences  in  input  speech  data.  Classification  was  based  on  summed  logarithms  ol  the  language 
likelihood  estimates,  given  the  occurrences  ol  the  reference  sequences. 


Sequences  of  length  4 performed  best  in  classifying  training  speakers.  For  this  best  ease,  an 
entropy  threshold  of  2.3  provided  the  best  rejection  of  sequences  not  having  sufficient  language 
specificity,  and  the  acceptance  threshold  was  set  such  that  12.5-percent  ol  all  hypothesized 
sequences  were  considered.  In  classifying  the  50  training  speakers,  88-percent  correct  five- 

language  classification  resulted. 

A decision  rule  using  sequences  of  length  5 in  combination  with  sequences  of  length  1 
(single  segments)  yielded  best  performance  in  classifying  the  test  speakers,  yielding  -percen 
correct  five-language  classification.  Again,  the  2.3  entropy  rejection  level  and  1 -.5-percent 
acceptance  level  for  hypothesized  sequences  (as  predicted  from  training  data  results)  proved  most 
useful  when  the  independent  test  data  was  classified. 


Speaker  dependence  proved  to  be  a formidable  obstacle  in  attaining  good  classification 
results.  The  same  nine  test  speakers  (18  percent  of  the  test  data  base)  were  m.sclass.fied  by  both 
the  decision  rule  B using  sequences  of  length  5 and  the  decision  rule  using  D (S,L)  tor  single 
segments.  To  reduce  such  speaker  dependency,  the  following  improvements  are  planned. 

A voiced-data  indicator  and  a pitch  measure  should  be  included  in  the  spectral  data 
representation. 

Labeling  of  sequences  as  to  basic  sound  type  (e.g.,  stop,  iricative,  vowel,  consonant) 
should  provide  better  sequence  classification  as  well  as  less  speaker 

dependency. 

To  allow  averaging  the  effects  of  individual  speakers,  separate  representation  of 
overall  sequence  data  should  be  defined  and  used. 


There  should  also  be  significant  improvement  in  data  processing  throughput  to  allow  more 
detailed  understanding,  analysis,  and  refinement  of  reierence  sequence  files. 
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